A. Project Description

Objectives

  • Implement/create an easily generalisable algorithm to detect fraud, it should be able to treat different data sources.
  • Use a supervised learning approach: Multilayer Perceptron (MLP).

Challenge description

In this notebook, we will test and implement a variety of Deep Learning algorihtms to detect fraud in credit card transactions. We base our study in a Kaggle dataset: Credit Card Fraud Detection.

Dataset description

All credit card transactions were recorded during 2 days in September 2013. This dataset is highly unbalanced: 492 frauds out of 284807 transactions (0.172% of transactions).

The dataset contains in total 30 variables: 28 variables which are the result of a PCA transformation, the time and the amount of the transaction. There is no prior knowledge about the original features (due to confidentiality).


Variable
Description
Features V1,...,V28
Principal components from the PCA transformation.
Time
Seconds elapsed between each transactions since the first transaction.
Amount
Transaction amount (unknown currency).
Classes
It is a label of the transaction. 1 if is a fraud, 0 otherwise.

Error metric


Using the accuracy is not enough to measure the performance of an algorithm for anomaly detection in a highly unbalanced dataset. The precision, also known as positive prediction value, and recall, also known as sensitivity, metrics are more useful in this case. Those metrics will help to avoid wrong interpretations of the algorithm performance.

Furthermore, the recommended metric by Kaggle for this dataset is the Area Under the Precision-Recall Curve (AUPRC). This metric as well as the F1 score are used in this study in the comparison of the different algorithms.

Libraries to use in this project


Axiolib
Axioanble's home made library.
Pandas
A library providing high-performance, easy-to-use data structures and data analysis tools for Python.
Numpy
A library for scientific computing with Python.
Matplotlib
A Python 2D plotting library.
Tensorflow
An open source software library for numerical computation using data flow graphs. Generally used for neural network implementation.
Scikit-learn
A Python library for machine learning.
Seaborn
A library for statistical plot representations.
In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns

# Plot libraries
from pylab import rcParams
import matplotlib.pyplot as plt

# Machine Learning tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, f1_score, average_precision_score
from sklearn.metrics import recall_score, precision_score

# Keras framework: Deep learning
from keras import backend as K
from keras.models import Model, load_model
from keras.layers import Input, Dense, Lambda, Dropout
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from keras import regularizers, metrics

from scipy.stats import norm

# Import axiolib functions
import axiolib.plot as axioplot
from axiolib.plot import histogram_from_dataframe
axioplot.init_notebook_mode()

%matplotlib inline

# Setting seaborn configuration for the figures
sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 14, 8

random_seed = 42
labels = ['Normal','Fraud']
/Users/jeancupe/ddf_env/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

B. Data exploration

NOTE: A data cleaning process was not necessary for the current dataset because the anonyme variables are the result of a dimensional reduction process (i.e. PCA), and there is no missing values for the time and amount variables.

Download data from Kaggle


You can download the dataset from the website of Kaggle Credit Card Fraud Detection, and place the file in the same repertory of the notebook.

In [4]:
data = pd.read_csv("s3://axiods/detection_de_fraude/creditcard.csv")
data.head()
Out[4]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

Getting some data statistics


We will explore the characteristics of each feature (i.e. feature description). This will let us know if the data is already centered or not (with mean zero and standard deviation of one).

In [5]:
data.describe()
Out[5]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 ... 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000 284807.000000
mean 94813.859575 1.165980e-15 3.416908e-16 -1.373150e-15 2.086869e-15 9.604066e-16 1.490107e-15 -5.556467e-16 1.177556e-16 -2.406455e-15 ... 1.656562e-16 -3.444850e-16 2.578648e-16 4.471968e-15 5.340915e-16 1.687098e-15 -3.666453e-16 -1.220404e-16 88.349619 0.001727
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00 ... 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01 250.120109 0.041527
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00 -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01 ... -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00 -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000 0.000000
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01 -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01 ... -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01 -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000 0.000000
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02 -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02 ... -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02 22.000000 0.000000
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01 ... 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02 77.165000 0.000000
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01 ... 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000 1.000000

8 rows × 31 columns

From the stats of the Dataframe we can say that:
  • The 28 variables comming from the PCA transformation are more or less in the same order of magnitude. They are almost centered: mean close to `zero` but standard deviation different to `one`
  • The data is not normalized.
  • A box-plot of each variable could give us a better idea of the data variability of data. However, due to the big quantity of datapoints a sub-sampling will be needed.

Checking null or missing values in the dataset:

In [6]:
# To check if there is any missing value in the dataframe
data.isnull().sum()
Out[6]:
Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64
There is NO MISSING values in the dataset !

Analysis of the time and amount variables


We will analyze the only two known variables: time and amount. The time could be useful to see if there is any behavior frequency (repetition) in fraudulent transactions. The amount in the other hand could give us an idea about the common amount in fraudulent transactions.

Here we will search for some data patterns as:

  • Feature statistc analysis: mean, standard deviation, min, and max
  • Frequent fraudulent behaviour: Is there any pattern for fraudulent and non-fraudulent behaviors? How does the data represent the reality?
  • Are these variables necessary for the modeling of fraud/non-fraud credit card transactions?

Looking the distribution for the time feature:

In [7]:
# Analysing the 'TIME' feature of the dataset
print("Fraud transactions:")
print(data.Time[data.Class==1].describe())
print("Normal transactions:")
print(data.Time[data.Class==0].describe())
Fraud transactions:
count       492.000000
mean      80746.806911
std       47835.365138
min         406.000000
25%       41241.500000
50%       75568.500000
75%      128483.000000
max      170348.000000
Name: Time, dtype: float64
Normal transactions:
count    284315.000000
mean      94838.202258
std       47484.015786
min           0.000000
25%       54230.000000
50%       84711.000000
75%      139333.000000
max      172792.000000
Name: Time, dtype: float64
  • We think that `time` does not add any additional information to predict fraud, as all observations are independent
  • We confirm that data is highly unbalanced
  • We confirm that there are two days of data in the samples `48(hours) x 60(minutes) x 60(seconds) ~ 172 800s`

Looking the distribution for the amount feature:

In [8]:
# Analysing the 'AMOUNT' feature of the dataset
print("Fraud transactions:")
print(data.Amount[data.Class==1].describe())
print("Normal transactions:")
print(data.Amount[data.Class==0].describe())
Fraud transactions:
count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64
Normal transactions:
count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64
  • Half of fraud transactions (50%) does not involve a big amount, i.e. less than 10. And the majority of these transactions (75%) involves an amount less than 106.
  • We think that the `amount` variable could help us to detect anomalous behaviors.

Plot fraud/not fraud behaviour hour by hour

In [9]:
# We need to transform the TIME variable from seconds to hours
data["hour"] = data.Time.map(lambda x: np.ceil(x / 3600))
In [10]:
axioplot.histogram_from_dataframe(data[data.Class==1],
                                  label="hour",
                                  maxnbinsx=50,
                                  xaxis_title="Fraudulent transactions",
                                  yaxis_title="Nb transactions by hour")

axioplot.histogram_from_dataframe(data[data.Class==0],
                                  label="hour",
                                  maxnbinsx=50,
                                  xaxis_title="Normal transactions",
                                  yaxis_title="Nb transactions by hour")
If we consider that the data recording started at the begining of the day, i.e. `time = 0` means `time = 12am`. From the last plots, we can conclude:
  • Credit card transactions are done mostly during working time, that confirms that the data follows the reality.
  • For fraud transactions, there is a pick at 12h, but not a 36h. There is no periodicity in this behavior.
  • It's difficult to have more insigths because we don't know if the dataset comes from workweek or weekend days.
  • For a better understanding of future datasets, it is important to know the days that are registered or to have more registered days.

C. Implementing an Undercomplete Autoencoder

An autoencoder is a Neural Network for Unsupervised Learning. The advantage of unsupervised learning is to:

  • Automatically extract meaningful features from the data.
  • Leverage the availability of unlabeled data.
  • Add a data-dependent regularizer to trainings.

Common Loss function to minimize: Squared Error

NOTE: Other neural networks used in Unsupervised Learning are: Restricted Boltzmann Machines and Sparse Coding Models.

An `Autoencoder` is a neural network architecture that is composed by an Encoder and a Decoder. The goal of an autoencoder is to copy its input to its output using a reconstruction process. The encoder will map the input in a hidden layer space known as code and the decoder will reconstruct the input from the code.

An `Undercomplete Autoencoder` uses a dimensional reduction mechanism similar to PCA. In consequence, the code has less dimensions than the input.

1. Preparing the data:

  • Dropping the 'Time' and 'hour' columns from the dataset
  • Standardize features by removing the mean and scaling to unit variance
  • Dividing the dataset in X and Y groups. X has all the features except Class, and Y has only the Class column.
NOTE: The labels are not necessary for the training of the Autoencoder. However, they will be used in a post-processing step to analyse the detection of fraudulent transactions considering the `reconstruction error`.
In [11]:
# Dropping 'Time'
X = data.drop(['Time','Class','hour'],axis=1)
Y = data.Class

# Standardising features
X = StandardScaler().fit_transform(X)

X_normal = X[data.Class == 0]
X_fraud  = X[data.Class == 1]

Y_normal = Y[data.Class == 0]
Y_fraud = Y[data.Class == 1]

print("X shape:",np.shape(X))
print("Y shape:",np.shape(Y))
print("X_normal shape:",np.shape(X_normal))
print("X_fraud shape:",np.shape(X_fraud))
X shape: (284807, 29)
Y shape: (284807,)
X_normal shape: (284315, 29)
X_fraud shape: (492, 29)

The dataset is splitted in three different sets: train, validation and test sets:

  • The training set is used to tune the hyperparameters of the MLP.
  • The validation set is used to adjust some hyperparameters and to save the best model according the best generalization of the model after the training.
  • The test set is used to confirm that the trained model could well generalize in new datapoints.
NOTE: The training set only contains normal transactions. This is very important in the fraud detection process. If the training set only contains normal transactions, the Autoencoder will learn how to reconstruct them. If there is any fraudulent transaction, the reconstruction error will be higher because the Autoencoder will not learn how to reproduce this type of transactions.

The test set contains normal transactions and all the fraudulent transactions of the dataset.

In [12]:
X_normal_train, X_normal_test, Y_normal_train, Y_normal_test = train_test_split(X_normal,Y_normal,test_size=0.2)
X_fraud_test, X_fraud_val, Y_fraud_test, Y_fraud_val = train_test_split(X_fraud,Y_fraud,test_size=0.5)

# Only normal cases for the VAL SET
X_train, X_val = train_test_split(X_normal_train,test_size=0.125)

#X_test = np.concatenate([X_normal_test,X_fraud_test],axis=0)
X_test = np.concatenate([X_normal_test,X_fraud],axis=0)
Y_test = np.concatenate([Y_normal_test,Y_fraud],axis=0)

print("X fraud test shape:",np.shape(X_fraud_test))
print("X normal test shape:",np.shape(X_normal_test))
print("X train shape:",np.shape(X_train))
print("X val shape:",np.shape(X_val))
print("X test shape",np.shape(X_test))
X fraud test shape: (246, 29)
X normal test shape: (56863, 29)
X train shape: (199020, 29)
X val shape: (28432, 29)
X test shape (57355, 29)

2. Building the model

The autoencoder has an autoencoder and a decoder, both of them have an architecture based on a Multi-Layer Perceptron with the same number of hidden layers.

We use a L1 regularizer in the encoder, and a Dropout at the end of the Encoder.

In [13]:
input_dim = X_train.shape[1]
encoder_l1 = int(input_dim*0.5)
encoder_l2 = int(encoder_l1*0.5)

decoder_l1 = int(encoder_l2*2)
decoder_l2 = input_dim

dropout_prob = 0.2 # Percentage to keep
dropout_seed = 10

NOTE: The order of the activation functions in the decoder are not the same as in the encoder.

In [14]:
def get_autoencoder_model():
    # Building the model
    inputs = Input(shape=(input_dim,))

    # ENCODER layers
    encoder = Dense(units=encoder_l1, activation="tanh",
                    activity_regularizer=regularizers.l1(10e-5))(inputs)
    encoder = Dense(units=encoder_l2, activation="relu")(encoder)
    encoder = Dropout(dropout_prob,seed=dropout_seed)(encoder)

    # DECODER layers
    decoder = Dense(units=decoder_l1, activation="tanh")(encoder)
    decoder = Dense(units=decoder_l2, activation="relu")(decoder)

    # Defining the AUTOENCODER
    autoencoder = Model(inputs=inputs, output=decoder)

    # Compiling the model
    autoencoder.compile(optimizer='Adam',loss='mean_squared_error',
                       metrics=['accuracy'])
    
    return autoencoder

def get_autoencoder_model1():
    # Building the model
    inputs = Input(shape=(input_dim,))

    # ENCODER layers
    encoder = Dense(units=14, activation="tanh",
                    activity_regularizer=regularizers.l1(10e-5))(inputs)
    encoder = Dense(units=14, activation="relu")(encoder)
    encoder = Dense(units=7, activation="relu")(encoder)
    encoder = Dropout(dropout_prob,seed=dropout_seed)(encoder)

    # DECODER layers
    decoder = Dense(units=7, activation="tanh")(encoder)
    decoder = Dense(units=14, activation="relu")(decoder)
    decoder = Dense(units=29, activation="relu")(decoder)

    # Defining the AUTOENCODER
    autoencoder = Model(inputs=inputs, output=decoder)

    # Compiling the model
    autoencoder.compile(optimizer='Adam',loss='mean_squared_error',
                       metrics=['accuracy'])
    
    return autoencoder

There are two different models for the Undercomplete Autoencoder. The first one has 2 hidden layers in the autoencoder and 1 hidden layer in the decoder. The second model has one more hidden layer in both the Encoder and the Decoder.

In [17]:
# TRAINING process
nb_epochs = 500
batch_size = 2048

autoencoder  = get_autoencoder_model()
autoencoder1 = get_autoencoder_model1()

print("="*50)
print("FIRST MODEL TRAINING")
print("="*50)

# Autoencoder: Defining checkpoints
bestModelFile = 'autoencoder_5.h5'
checkpoint = ModelCheckpoint(filepath=bestModelFile,verbose=0,
                             monitor='val_loss',mode='min',
                             save_best_only=True)
reduce_LR  = ReduceLROnPlateau(monitor='val_loss',factor=0.5,
                               patience=10,verbose=False)
early_stop = EarlyStopping(monitor='val_loss',patience=50,verbose=True)

callbacks = [checkpoint, reduce_LR, early_stop]

history = autoencoder.fit(X_train,X_train,epochs=nb_epochs,
                          batch_size=batch_size,shuffle=True,
                          validation_data=(X_val,X_val),verbose=0,
                          callbacks=callbacks)

print("="*50)
print("SECOND MODEL TRAINING")
print("="*50)

# Autoencoder1: Defining checkpoints
bestModelFile1 = 'autoencoder_6.h5'
checkpoint = ModelCheckpoint(filepath=bestModelFile1,verbose=0,
                             monitor='val_loss',mode='min',
                             save_best_only=True)

callbacks = [checkpoint, reduce_LR, early_stop]

history1 = autoencoder1.fit(X_train,X_train,epochs=nb_epochs,
                          batch_size=batch_size,shuffle=True,
                          validation_data=(X_val,X_val),verbose=0,
                          callbacks=callbacks)
/Users/jeancupe/ddf_env/lib/python3.6/site-packages/ipykernel_launcher.py:16: UserWarning:

Update your `Model` call to the Keras 2 API: `Model(inputs=Tensor("in..., outputs=Tensor("de...)`

/Users/jeancupe/ddf_env/lib/python3.6/site-packages/ipykernel_launcher.py:41: UserWarning:

Update your `Model` call to the Keras 2 API: `Model(inputs=Tensor("in..., outputs=Tensor("de...)`

==================================================
FIRST MODEL TRAINING
==================================================
Epoch 00298: early stopping
==================================================
SECOND MODEL TRAINING
==================================================
Epoch 00305: early stopping

Working with the best model only:

First Autoencoder model

In [18]:
autoencoder = load_model(bestModelFile)
autoencoder.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_5 (InputLayer)         (None, 29)                0         
_________________________________________________________________
dense_21 (Dense)             (None, 14)                420       
_________________________________________________________________
dense_22 (Dense)             (None, 7)                 105       
_________________________________________________________________
dropout_5 (Dropout)          (None, 7)                 0         
_________________________________________________________________
dense_23 (Dense)             (None, 14)                112       
_________________________________________________________________
dense_24 (Dense)             (None, 29)                435       
=================================================================
Total params: 1,072
Trainable params: 1,072
Non-trainable params: 0
_________________________________________________________________

Second Autoencoder model

In [19]:
autoencoder1 = load_model(bestModelFile1)
autoencoder1.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_6 (InputLayer)         (None, 29)                0         
_________________________________________________________________
dense_25 (Dense)             (None, 14)                420       
_________________________________________________________________
dense_26 (Dense)             (None, 14)                210       
_________________________________________________________________
dense_27 (Dense)             (None, 7)                 105       
_________________________________________________________________
dropout_6 (Dropout)          (None, 7)                 0         
_________________________________________________________________
dense_28 (Dense)             (None, 7)                 56        
_________________________________________________________________
dense_29 (Dense)             (None, 14)                112       
_________________________________________________________________
dense_30 (Dense)             (None, 29)                435       
=================================================================
Total params: 1,338
Trainable params: 1,338
Non-trainable params: 0
_________________________________________________________________

3. Plotting the history of metrics:

In [20]:
plt.subplot(2,1,1)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('loss')
plt.legend(['train','val'],loc='upper right')
plt.title("First Model")
plt.subplot(2,1,2)
plt.plot(history1.history['loss'])
plt.plot(history1.history['val_loss'])
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','val'],loc='upper right')
plt.title("Second Model")

print("-"*30)
print("FIRST MODEL")
print("Min val_loss:",np.min(history.history['val_loss']))
print("-"*30)
print("SECOND MODEL")
print("Min val_loss:",np.min(history1.history['val_loss']))
------------------------------
FIRST MODEL
Min val_loss: 0.7625946971762362
------------------------------
SECOND MODEL
Min val_loss: 0.7753601543475351

We can observe that the reconstruction errors for both the training and validation sets converge after 500 epochs. However, we only work with the model that best generalise the model for the validation set.

4. Predicting fraudulent transactions in the test set

First, we should see the histogram of the reconstruction error for the train and test sets. This could give us a better idea about the threshold to consider to distinguish normal and fraudulent transactions.
Furthermore, we can observe if the model could reconstruct well the traning set.

In [21]:
X_train_pred = autoencoder.predict(X_train)
X_test_pred  = autoencoder.predict(X_test) 

train_mse = np.mean(np.power(X_train - X_train_pred, 2), axis =1)
test_mse = np.mean(np.power(X_test - X_test_pred, 2), axis =1)

max_mse = 100

f, ax = plt.subplots(3,1,sharex=True)
ax[0].hist(train_mse[(train_mse < max_mse)],bins=20)
ax[0].set_yscale('log')
ax[0].set_ylabel('Nb transactions')
ax[0].set_title('Train - Normal transactions')
ax[1].hist(test_mse[(Y_test==0) & (test_mse < max_mse)],bins=20)
ax[1].set_yscale('log')
ax[1].set_ylabel('Nb transactions')
ax[1].set_title('Test - Normal transactions')
ax[2].hist(test_mse[(Y_test==1) & (test_mse < max_mse)],bins=20)
ax[2].set_ylabel('Nb transactions')
ax[2].set_title('Test - Fraud transactions')
ax[2].set_xlabel('Mean Squared Error (MSE)')

print("-"*30)
print("FIRST MODEL")
print("-"*30)
print("Nb samples in test:",len(test_mse))
print("Nb samples with error MSE less than 10:",np.shape(test_mse[(Y_test==0) & (test_mse<10)]))
print("Nb samples of Fraud transactions:",np.sum(Y_test))
print("Max Error in the test set:",np.max(test_mse))
------------------------------
FIRST MODEL
------------------------------
Nb samples in test: 57355
Nb samples with error MSE less than 10: (56651,)
Nb samples of Fraud transactions: 492
Max Error in the test set: 540.1832023709817
In [22]:
X_train_pred1 = autoencoder1.predict(X_train)
X_test_pred1  = autoencoder1.predict(X_test) 

train_mse1 = np.mean(np.power(X_train - X_train_pred1, 2), axis =1)
test_mse1 = np.mean(np.power(X_test - X_test_pred1, 2), axis =1)

f, ax = plt.subplots(3,1,sharex=True)
ax[0].hist(train_mse1[(train_mse1<max_mse)],bins=20)
ax[0].set_yscale('log')
ax[0].set_ylabel('Nb transactions')
ax[0].set_title('Train - Normal transactions')
ax[1].hist(test_mse1[(Y_test==0) & (test_mse1 < max_mse)],bins=20)
ax[1].set_yscale('log')
ax[1].set_ylabel('Nb transactions')
ax[1].set_title('Test - Normal transactions')
ax[2].hist(test_mse1[(Y_test==1) & (test_mse1 < max_mse)],bins=20)
ax[2].set_ylabel('Nb transactions')
ax[2].set_title('Test - Fraud transactions')
ax[2].set_xlabel('Mean Squared Error (MSE)')

print("-"*30)
print("SECOND MODEL")
print("-"*30)
print("Nb samples in test:",len(test_mse1))
print("Nb samples with error MSE less than 10:",np.shape(test_mse1[(Y_test==0) & (test_mse1<10)]))
print("Nb samples of Fraud transactions:",np.sum(Y_test))
print("Max Error in the test set:",np.max(test_mse1))
------------------------------
SECOND MODEL
------------------------------
Nb samples in test: 57355
Nb samples with error MSE less than 10: (56628,)
Nb samples of Fraud transactions: 492
Max Error in the test set: 549.980996965085
From the last two figures we can observe:
  • The first model has a lower number of fraudulent transactions with small reconstruction error.
  • In both figures, the number of normal transactions for both training and test sets decreases when the reconstruction error increases.
  • For a reconstruction error higher than 2, the number of fraudulent transactions increases. However the number of normal transactions remains higher. We should to figure out a trade-off between a higher detection of fraudulent transactions and a lower detection of false positives (i.e. label a normal transaction as a fraudulent one).
In [23]:
# Setting the threshold from the last figure
min_threshold = 1.
max_threshold = 10.
threshold_step = 0.1

threshold_range = np.arange(min_threshold,max_threshold,threshold_step)

mdl1_recall = []
mdl1_precision = []
mdl1_f1 = []
mdl1_aucpr = []
mdl2_recall = []
mdl2_precision = []
mdl2_f1 = []
mdl2_aucpr = []

for thr in threshold_range:
    #print("-"*50)
    #print("threshold:",thr)
    #print("-"*50)
    
    Y_pred = [1 if e > thr else 0 for e in test_mse]
    mdl1_f1.append(f1_score(Y_test,Y_pred))
    mdl1_precision.append(precision_score(Y_test,Y_pred))
    mdl1_recall.append(recall_score(Y_test,Y_pred))
    mdl1_aucpr.append(average_precision_score(Y_test,Y_pred))
    #print("FIRST MODEL")
    #print("+"*30)
    #print("Autoencoder - Confusion matrix:")
    #print(confusion_matrix(Y_test,Y_pred))
    #print("F1 score:",f1_score(Y_test,Y_pred))
    #print("Precision score:",precision_score(Y_test,Y_pred))
    #print("Recall score:",recall_score(Y_test,Y_pred))
    #print("Average precision score:",average_precision_score(Y_test,Y_pred))

    Y_pred1 = [1 if e > thr else 0 for e in test_mse1]
    mdl2_f1.append(f1_score(Y_test,Y_pred1))
    mdl2_precision.append(precision_score(Y_test,Y_pred1))
    mdl2_recall.append(recall_score(Y_test,Y_pred1))
    mdl2_aucpr.append(average_precision_score(Y_test,Y_pred1))
    #print("+"*30)
    #print("SECOND MODEL")
    #print("+"*30)
    #print("Autoencoder - Confusion matrix:")
    #print(confusion_matrix(Y_test,Y_pred1))
    #print("F1 score:",f1_score(Y_test,Y_pred1))
    #print("Precision score:",precision_score(Y_test,Y_pred1))
    #print("Recall score:",recall_score(Y_test,Y_pred1))
    #print("Average precision score:",average_precision_score(Y_test,Y_pred1))
In [24]:
f, ax = plt.subplots(4,1,sharex=True)
ax[0].plot(threshold_range,mdl1_f1)
ax[0].set_ylabel('F1 SCORE')
ax[0].set_title('FIRST MODEL')
ax[1].plot(threshold_range,mdl1_precision)
ax[1].set_ylabel('PRECISION')
ax[2].plot(threshold_range,mdl1_recall)
ax[2].set_ylabel('RECALL')
ax[3].plot(threshold_range,mdl1_aucpr)
ax[3].set_ylabel('AUCPR')
ax[3].set_xlabel('Threshold')
Out[24]:
Text(0.5,0,'Threshold')
In [25]:
f, ax = plt.subplots(4,1,sharex=True)
ax[0].plot(threshold_range,mdl2_f1)
ax[0].set_ylabel('F1 SCORE')
ax[0].set_title('SECOND MODEL')
ax[1].plot(threshold_range,mdl2_precision)
ax[1].set_ylabel('PRECISION')
ax[2].plot(threshold_range,mdl2_recall)
ax[2].set_ylabel('RECALL')
ax[3].plot(threshold_range,mdl2_aucpr)
ax[3].set_ylabel('AUCPR')
ax[3].set_xlabel('Threshold')
Out[25]:
Text(0.5,0,'Threshold')

To chose a threshold to Fraud detection, we can set a minimum requirement for the recall metric. We could chose a threshold with a minimun recall of 0.8 but a maximum precision/f1-score.

In [26]:
# FIRST MODEL
mdl1_recall = np.array(mdl1_recall)
mdl1_f1 = np.array(mdl1_f1)
min_recall1 = 0.75

threshold_idx1 = np.argmax(mdl1_f1[mdl1_recall>min_recall1])

Y_pred = [1 if e > threshold_range[threshold_idx1] else 0 for e in test_mse]
print("FIRST MODEL")
print("-"*30)
print("Autoencoder - Confusion matrix:")
print(confusion_matrix(Y_test,Y_pred))
print("F1 score:",f1_score(Y_test,Y_pred))
print("Precision score:",precision_score(Y_test,Y_pred))
print("Recall score:",recall_score(Y_test,Y_pred))
print("Average precision score:",average_precision_score(Y_test,Y_pred))

# SECOND MODEL
mdl2_recall = np.array(mdl1_recall)
mdl2_f1 = np.array(mdl1_f1)
min_recall2 = 0.75

threshold_idx2 = np.argmax(mdl2_f1[mdl2_recall>min_recall2])

Y_pred1 = [1 if e > threshold_range[threshold_idx2] else 0 for e in test_mse1]
print("-"*30)
print("SECOND MODEL")
print("-"*30)
print("Autoencoder - Confusion matrix:")
print(confusion_matrix(Y_test,Y_pred1))
print("F1 score:",f1_score(Y_test,Y_pred1))
print("Precision score:",precision_score(Y_test,Y_pred1))
print("Recall score:",recall_score(Y_test,Y_pred1))
print("Average precision score:",average_precision_score(Y_test,Y_pred1))
FIRST MODEL
------------------------------
Autoencoder - Confusion matrix:
[[56239   624]
 [  108   384]]
F1 score: 0.512
Precision score: 0.38095238095238093
Recall score: 0.7804878048780488
Average precision score: 0.2992116969004603
------------------------------
SECOND MODEL
------------------------------
Autoencoder - Confusion matrix:
[[56175   688]
 [  110   382]]
F1 score: 0.4891165172855314
Precision score: 0.35700934579439253
Recall score: 0.7764227642276422
Average precision score: 0.2791080629877634
We can observe that the `False Positive` with the Undercomplete Autoencoder model is high. However, we observe that the `recall` is good to detect fraudulent transactions in a non-labeled dataset.
This result could be used to reduce the number of transactions to analyse and to label. Indeed, new type of fraudulent transactions can be detected if we have a reduced number of transactions to analyse.

Variational Autoencoder

In [27]:
batch_size = 2048
original_dim = 29 #X_train.shape[1]
latent_dim = 7 # Number of possible values of the output?
intermediate_dim = 14
epsilon_std = 1.0

def sampling(args):
    '''
        This function is to sample new similar points from the latent
        space.
        Source: blog.keras.io
    '''
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(K.shape(z_mean)[0], 7),
                              mean=0.0, stddev=1.0)
    #epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim),
    #                          mean=0, stddev=epsilon_std)
    return z_mean + K.exp(z_log_var/2)*epsilon

def get_variatonal_autoencoder_model():
    '''
        This function creates the Variational Autoencoder model plus
        an encoder and a generator models.
        Architecture: 1 hidden layer + 1 output layer
        Source: blog.keras.io
    '''
    
    # ENCODER
    x = Input(shape=(original_dim,))
    h = Dense(intermediate_dim, activation='relu')(x)
    z_mean = Dense(latent_dim)(h)
    z_log_var = Dense(latent_dim)(h)
    
    # SAMPLING
    z = Lambda(sampling, output_shape=(latent_dim,))([z_mean,z_log_var])
    
    # DECODER
    decoder_h = Dense(intermediate_dim, activation='relu')
    decoder_mean = Dense(original_dim, activation='sigmoid')
    h_decoded = decoder_h(z)
    x_decoded_mean = decoder_mean(h_decoded)
    
    # end-to-end Autoencoder
    vae = Model(x,x_decoded_mean)
    
    # ENCODER: from inputs to latent space
    encoder = Model(x,z_mean)
    
    # DECODER: from latent space to reconstructed inputs
    decoder_input = Input(shape=(latent_dim,))
    _h_decoded = decoder_h(decoder_input)
    _x_decoded_mean = decoder_mean(_h_decoded)
    
    generator = Model(decoder_input,_x_decoded_mean)
    
    def vae_loss(x, x_decoded_mean):
        '''
        The loss function is the sum of a reconstruction loss and a KL
        divergence regularization term.
        '''
        # Reconstruction loss
        # Use of the binary crossentropy because there are only two classes
        xent_loss = original_dim*metrics.binary_crossentropy(x,x_decoded_mean)
        # KL divergence loss
        kl_loss = -0.5*K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=1)
    
        return K.mean(xent_loss + kl_loss)
    
    # Training model
    vae.compile(optimizer='rmsprop',loss=vae_loss)
    vae.summary()
    return vae
In [ ]:
nb_epochs = 500

#vae_model = get_variatonal_autoencoder_model()

########################
# ENCODER
x = Input(shape=(original_dim,))
h = Dense(intermediate_dim, activation='relu')(x)
z_mean = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)
    
# SAMPLING
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean,z_log_var])
    
# DECODER
decoder_h = Dense(intermediate_dim, activation='relu')
decoder_mean = Dense(original_dim, activation='sigmoid')
h_decoded = decoder_h(z)
x_decoded_mean = decoder_mean(h_decoded)
    
# end-to-end Autoencoder
vae_model = Model(x,x_decoded_mean)
    
# ENCODER: from inputs to latent space
encoder = Model(x,z_mean)
    
# DECODER: from latent space to reconstructed inputs
decoder_input = Input(shape=(latent_dim,))
_h_decoded = decoder_h(decoder_input)
_x_decoded_mean = decoder_mean(_h_decoded)
    
generator = Model(decoder_input,_x_decoded_mean)
    
def vae_loss(x, x_decoded_mean):
    '''
        The loss function is the sum of a reconstruction loss and a KL
        divergence regularization term.
    '''
    # Reconstruction loss
    # Use of the binary crossentropy because there are only two classes
    xent_loss = original_dim*metrics.binary_crossentropy(x,x_decoded_mean)
    # KL divergence loss
    kl_loss = -0.5*K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=1)
    
    return K.mean(xent_loss + kl_loss)
    
# Training model
vae_model.compile(optimizer='rmsprop',loss=vae_loss)
vae_model.summary()


########################

# Defining checkpoints
vae_bestModelFile = 'vae_autoencoder.h5'
vae_checkpoint = ModelCheckpoint(filepath=vae_bestModelFile,verbose=1,
                                 monitor='val_loss',mode='min',
                                 save_best_only=True)
vae_earlystop = EarlyStopping(monitor='val_loss',patience=20,
                              verbose=1,mode='min')

vae_history = vae_model.fit(X_train, X_train, shuffle=True,
                            epochs=nb_epochs, batch_size=batch_size,
                            validation_data=(X_test,X_test), verbose=0,
                            callbacks=[vae_checkpoint,vae_earlystop])
In [32]:
#vae_model = load_model(vae_bestModelFile)
#X_pred_vae = vae_model.predict(X_test)
vae_model.load_weights(vae_bestModelFile)
In [35]:
X_train_pred_vae = vae_model.predict(X_train)
X_test_pred_vae = vae_model.predict(X_test)

train_mse_vae = np.mean(np.power(X_train - X_train_pred_vae, 2), axis =1)
test_mse_vae = np.mean(np.power(X_test - X_test_pred_vae, 2), axis =1)

f, ax = plt.subplots(3,1)
#ax[0].hist(mse_vae[(Y_test==0) & (mse_vae<10)],bins=20)
ax[0].hist(train_mse_vae,bins=20)
#ax[0].hist(mse_vae[(Y_test==0)],bins=20)
ax[0].set_yscale('log')
ax[0].set_ylabel('Nb transactions')
ax[0].set_title('Normal transactions')
ax[1].hist(test_mse_vae[Y_test==0],bins=20)
ax[1].set_yscale('log')
ax[1].set_ylabel('Nb transactions')
ax[1].set_title('Test - Normal transactions')
ax[2].hist(test_mse_vae[Y_test==1],bins=20)
ax[2].set_ylabel('Nb transactions')
ax[2].set_title('Test - Fraud transactions')
ax[2].set_xlabel('Mean Squared Error (MSE)')

print("Nb samples in test:",len(test_mse_vae))
#print("Nb samples with error MSE less than 10:",np.shape(mse[(Y_test==0) & (mse<10)]))
#print("Nb samples of Fraud transactions:",np.sum(Y_test))
print(np.max(test_mse_vae))
Nb samples in test: 57355
550.1220651365874
In [36]:
# Setting the threshold from the last figure
threshold = 4.5
Y_pred_vae = [1 if e > threshold else 0 for e in test_mse_vae]
print(confusion_matrix(Y_test,Y_pred_vae))
print(f1_score(Y_test,Y_pred_vae))
[[56040   823]
 [  129   363]]
0.432657926102503